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Abstract 

The main performance bottleneck of gravitational A^-body codes is the force cal- 
culation between two particles. We have succeeded in speeding up this pair-wise 
force calculation by factors between two and ten, depending on the code and the 
processor on which the code is run. These speedups were obtained by writing highly 
fine-tuned code for x86_64 microprocessors. Any existing A^-body code, running on 
these chips, can easily incorporate our assembly code programs. 

In the current paper, we present an outline of our overall approach, which we 
illustrate with one specific example: the use of a Hermite scheme for a direct A^^ 
type integration on a single 2.0 GHz Athlon 64 processor, for which we obtain an 
effective performance of 4.05 Gflops, for double precision accuracy. In subsequent 
papers, we will discuss other variations, including the combinations of N log N codes, 
single precision implementations, and performance on other microprocessors. 
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1 Introduction 



Some A^-body simulations can be sped up in various ways, by using faster 
algorithms such as tree codes (iBarnes and Hutl. IToSfih and/or s pecial pur- 
pose hardware such as the GRAPE family (|Sugimoto et al.l . ll99nl ). For some 
regimes, such as low values, these speed-up methods are not very efficient, 
and it would be nice to find other ways to improve the speed of such calcula- 
tions. It would be even better if these alternative ways can be combined with 
other methods of speed-up. 
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We explore here a general approach based on speeding up the inner loop of 
gravitational force calculations, namely the interactions between one pair of 
particles. Also when using tree codes this approach will be useful, since in 
that case the calculation cost is still dominated by force calculations. Even for 
GRAPE applications, this approach will still be useful in many cases, since 
there are always some calculations which are done more efficiently on the front 
end. 

In particular, we consider the optimization of the inner force loop on the 
x86-64 (or AMD64 or EM64T) architecture, the newest incarnation of the 
architecture that originated with the Intel 8080 microprocessor. Processors 
with an x86-64 instruction set are currently the most widely used. Athlon 64 
and Opteron microprocessors from AMD, and many recent models of Pentium 
4 and Xeon microprocessors from Intel, support this instruction set. 

As will be shown in section 2, a straightforward implementation of the inner 
force loop using either Fortran or C, compiled with standard compilers like 
GCC or ice for x86-64 processors, results in a performance that is significantly 
lower than the theoretical peak value one can expect from the hardware. In 
the following, we discuss how we can improve the performance of the force 
loop on processors with an x86-64 instruction set. 



1.1 The SSE2 Instruction Set 



Our approach is based on the use of new features added to the x86 micropro- 
cessors in the last eight years. The first one is the SSE2 instruction set for 
double-precision floating-point arithmetic. Traditionally, the instruction set 
for x86 microprocessors has included the so-called x87 instruction set, which 
was originally designed for the 8087 math coprocessor for the 8086 16-bit mi- 
croprocessor. This instruction set is stack-based, in the sense that it does not 
have any explicit way to specify registers. Instead, registers are indirectly ac- 
cessed as a stack, where two operands for an arithmetic operation are taken 
from the top of the stack (popped) and the result is placed back at the top 
of the stack (pushed) . Memory access also takes place through the top of the 
stack. 

This x87 instruction set had the advantage that the instructions are simple and 
few in number, but for the last fifteen years the design of a fast fioating-point 
unit for this x87 instruction set has been a major problem for all x86 micro- 
processors. If one would really design stack-based hardware, any pipelining 
would be practically impossible. In order to allow pipelining, current x86- 
based microprocessors, from Intel as well as AMD, translate the stack-based 
x87 instructions to RISC-like, presumably standard three-address register-to- 



2 



register instructions in hardware at execution time. 

This approach has given quite high performance, certainly much higher than 
what would have been possible with the original stack-based implementation. 
However, it was clear that pipelining and better use of hardware registers 
would be much easier if one could use an instruction set with explicit reference 
to the registers. In 2001, with the introduction of Pentium 4 microprocessors, 
Intel added such a new floating-point instruction set, which is called SSE2. It is 
still not a real three-address instruction set; it rather uses a two-address form 
where the address of the source register and that of the destination register 
are the same. Moreover, SSE2 still supports operations between the data in 
the main memory and that in the register, as was the case with the IBM 
System/360. Thus, it still has the look and feel of an instruction set from the 
1960s. 



1.2 Minimizing Memory Access 

Even though operations between operands in memory and operands in reg- 
isters are supported, clearly execution would be much faster if all operands 
could reside in registers. However, with the original SSE2 instruction set it was 
difficult to eliminate memory access for intermediate results, because there 
were only eight registers available for SSE2 instructions. For whatever reason, 
these registers are called "XMM" registers in the manufacturer's document, 
and we follow this convention. With the new x86_64 instruction set, the num- 
ber of these "XMM" registers was doubled from 8 to 16. The implication for 
A^-body calculations was that it now became possible to minimize memory 
access during the inner force loop. A form of optimization using this approach 
will be discussed in section 3. 



1.3 Exploiting Two-Word Double-Precision Parallelism 

Another important feature of SSE2 (which is actually Streaming SIMD exten- 
sion 2) is that it is defined to operate on a pair of two 64-bit fioating point 
words, instead of a single floating-point word. This effectively means that the 
use of SSE2 instructions automatically result in the execution of two floating- 
point operations in parallel. While this feature cannot be easily used with any 
compiler-based optimization, it is possible to gain considerable profit from 
this feature through judicious hand coding. We discuss the use of this parallel 
nature of the SSE2 instruction set in section 4. 
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1-4 Exploiting Four-Word Single- Precision Parallelism 



SSE2 is not the only new floating-point instruction set that has been made 
available for the x86 hardware. As the name SSE2 already suggests, there is 
an earlier SSE instruction set, which is similar to SSE2 but works only on 
single-precision floating-point numbers. As is the case with SSE2, SSE also 
works on multiple data in parallel, but instead of the two double-precision 
data of SSE2, SSE works on four single-precision floating-point numbers si- 
multaneously. Thus, the peak calculation speed of SSE is at least a factor of 
two higher than that of SSE2. For those force calculations where single preci- 
sion gives us a sufficient degree of accuracy, we can make use of SSE, gaining 
a performance that is even higher than what would have been possible with 
SSE2, as we discuss in section 5. 



1.5 Utilizing Built-in Inverse Square Root Instructions 



SSE was designed mainly to speed up coordinate transformation in three- 
dimensional graphics. As a result, it has a special instruction for the very fast 
calculation of an approximate inverse square root, which is intended as a good 
initial value for Newton-Raphson iteration. This is exactly what we need for 
the calculation of gravitational forces. We discuss the use of this approximate 
inverse square root for double-precision calculation, in section 4. 



1.6 Organization 



This paper is organized as follows. In section 2, we give the standard C- 
language implementation of the force loop. We consider the code fragment 
which calculates both the acceler ation and its first time der ivative. It is used 
with Hermite integration scheme ( Makino fc Aarseth . 1992| ). 



We present the assembly language output and measured performance, and 
describe possible room for improvements. We call these implementations the 
baseline implementation. 

In section 3, we discuss the optimized C-language implementation of the force 
loop for the Hermite scheme. The difference from the baseline implementation 
is that the C-language code is hand-tuned so that the load and store of inter- 
mediate results are eliminated from the generated assembly-language output. 
This version gives us the speed-up of 46% compared to the baseline method. 

In section 4, we discuss a more efficient use of SSE2 instruction, where forces 
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on two particles are calculated in parallel using the SIMD nature of the SSE2 
instruction set. Here, we use the intrinsic data types defined in GCC, which 
allows us to use this SIMD nature within the syntax of C-language. Also, 
we use the fast approximate square root instruction and a Newton-Raphson 
iteration. This implementation is 88% faster than basehne. 

In section 5, we discuss a mixed SSE-SSE2 implementation of the force calcula- 
tion for the Hermite scheme. In many applications, full double-precision accu- 
racy is not necessary, except for the first subtraction between positions and fi- 
nal accumulation of the forces. "High-accuracy" GRAPE hardwares (GRAPE- 
2, 4 and 6) rely on this mixed-accuracy calculation. Thus, it is possible to 
perform most of the force-calculation operations using SSE single-precision 
instructions, thereby further speeding up the force calculation. In this way, we 
can speed-up the calculation by another 67 % from the SSE2 parallel imple- 
mentation of section 4, achieving 219% speedup (3.19 times faster) from the 
baseline implementation. 



2 Baseline implementation 

2. 1 Target functions 

The target functions that we want to calculate are: 



TJlj 



,(4 +£2)3/2 (^2^^2)5/2 
-(4 + £2)1/^ 



(2) 



X! /^2 , !2M/2 (2) 



where cij and 0i are the gravitational acceleration and the potential of particle 
i, the jerk jj is the time derivative of the acceleration, and r^, Vj, and are 
the position, velocity and mass of particle i, v^j — rj — and — Vj — v^. 

The calculation of a and (p requires 9 multiplications, 10 addition/subtraction 
operations, 1 division and 1 square root calculation. Clearly, calculation of an 
inverse square root is more expensive than addition/multiplication operations. 
Therefore, if we want to measure the speed of a force calculation loop in terms 
of the number of floating-point operations per second, we need to introduce 
some conversion factor for division and square root calculations. In this paper, 
we use 38 as the total number of floating point operations for the calculation 
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of a and 0. This implies that we effectively assign around 10 operations to 
each division and t o each square root c alculation. This particular convention 
was introduced bv IWarren et al.l (|l997l ). and we follow it here since it seems 
to be a reasonable representation of the actual computational costs of force 
calculations on typical scalar processors. 

In addition, it requires 11 multiplications and 11 addition/subtraction oper- 
ations to calculate j. Thus, the total number of floating-point operations per 
inner force loop for the Hermite scheme can be given as 60. This is the number 
that we will use here. Note that we have previous ly used numbers tha t were 
slightly less, by about 5%, using 57 instead of 60 (jMakino et al .1120011 ). 



In the case of a simple leapfrog method or traditional "Aarseth" scheme, we 
need to calculate only and 0,. For a Hermite scheme we need to determine 
jj as well. 



2.2 Baseline implementation and its performance 



The following code fragments contain what we regard as the baseline imple- 
mentation of the force calculations, where we use the word 'force' loosely, to 
indicate the calculations for accelerations, jerks as well as the potential. In 
other words, we consider 'force calculations' to comprise all the basic low-level 
dynamical calculations governing the interactions between pairs of particles. 

List 1. Baseline implementation calculates force on ith. particle. 



#d6fin6 DIM 3 

typedef struct{ 

double X [DIM] ; 

double V [DIM] ; 

double m ; 
}Predictor ; 

typedef Predictor * pPredictor; 

typedef struct{ 

double a[DIM] ; 

double j [DIM] ; 

double pot ; 
}Acc Jerk ; 

typedef AccJerk * pAccJerk; 

void CalcAcc Jerk_Base (pPredictor pr , pAccJerk aj , 

double eps2 , int i, int nbody){ 

int j , k ; 

double r[3], v [3] , acc[3], jerk[3], pot; 

double r2 , rv ; 

double rinv,rinv2; 

double mrinv , nirinv3 , rvrinv2; 

pot = 0.0; 

for(k=0; k<3; k++) acc[k] = jerk[k] = 0.0; 
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Table 1 

Performance of the code shown in list 1 when N = 1024, in cycles-per-interaction 



and Gfiops, on Athlon 64 2.0GHz. 



C/OTTinilpr 




Ontions 


Cvcles 


Gflons 


GCC 3.3.1 


-03 -ffast-math 


-funroll-loops 


94.8 


1.27 


GCC 3.3.1 


-03 -ffast-math 


-funroll-loops -mfpmath=387 


119 


1.01 


GCC 4.0.1 


-03 -ffast-math 


-funroll-loops 


100 


1.20 


PGI 5.1 


-fastsse 




97.6 


1.23 


PathScale 2.0 


-03 -ffast-math 




95.0 


1.26 


ICC 9.0 


-03 




95.8 


1.25 



for(j=0; j<nbody; 

if(j == i) continue; 
for(k=0; k<3; k++){ 

r[k] = pr[j].x[k] - pr[i].x[k]; 

v[k] = pr[j].v[k] - pr[i].v[k]; 

} 

r2 = eps2; 

for(k = 0; k<3; k + + ) r2 += r [k] * r [k] ; 
rv = r [0] * V [0] ; 

for(k=l; k<3; k++) rv += r [k] * v [k] ; 

rinv2 = l./r2; 

rinv = sqrt(rinv2); 

rvrinv2 = 3 . * rv * rinv2;; 

mrinv = rinv * pr[j] .m; 
mrinvS = mrinv * rinv2; 

pot -= mrinv; 

for(k=0; k<3; k++){ 

double temp = r [k] * mrinvS; 
acc [k] += temp; 

jerk[k] += v [k] * mrinv3 - temp * rvrinv2; 

} 

} 



for(k=0; k<3; k++){ 

aj->a[k] = acc[k]; 
aj ->j [k] = jerk [k] ; 
aj->pot = pot; 

} 



Line 29 in list 1 is a branch to avoid self interaction. We might remove this 
branch when we can use softening, but in this case branch prediction of the mi- 
croprocessor works ideally, hence the performance changes little if we remove 
this. 
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Tabic 1 shows the performance of this code on an AMD Athlon 64 3000+ 
(2.0GHz) processor with several different compilers. The first column gives 
the compiler used, the second the compiler options, the third the clock cycles 
per pairwise force calculation, and the fourth column gives the speed in Gflops. 
All compilers generate SSE2 instructions instead of x87 instructions unless we 
explicitly set options to use x87. The performance is fairly good, but not ideal. 
In the following we will investigate how we can improve the performance. 



List 2 is the assembly language output of the Hermite force loop using the 
GCC compiler: 



List 2. Assembly output of list 1 using GCC 3.3.1 with option -S 
- funroll-loops. commented by the authors for clarity. 



-03 -ffast-math 



1 


xorpd 


'/,xmm9 , y.xmmS 




2 


xor 1 


■/.r8d, IrSd 




3 


subq 


$16, y.rsp 




4 


.LCFIO: 






5 


movsd 


y.xmmO , '/.xmmlO 




6 


movl 


"/.edx, "/.rlld 




7 


cmpl 


y, ecx , y.rSd 




8 


movl 


y.ecx, y.rlOd 




9 


movsd 


y.xmm9 , -120('/,rsp) 




10 


movsd 


y,xmm9 , -88 (y.rsp) 




11 


movsd 


y.xmm9 , -112 (y.rsp) 




12 


movsd 


y.xmm9 , -80 (y.rsp) 




13 


movsd 


y.xmm9 , -104 (y.rsp) 




14 


movsd 


y.xmm9 , -72 (y.rsp) 




15 


jge .L41 




16 


movslq 


y.edx , y.rdx 




17 


movlpd 


.LC4(y.rip), y.xmmll 




18 


movlpd 


.LC5(y.rip), y.xmml2 




19 


leaq 


0( ,y.rdx ,8) , y.r9 




20 


subq 


y.rdx , y.rS 




21 


.p2align 4, ,7 




22 


.L60: 






23 


cmpl 


y.rlld, y.r8d 




24 


je .L9 




# 


25 


movslq 


y.r8d , y.rcx 




26 


leaq 


( , y.rcx , 8) , y.rdx 




27 


subq 


y.rcx , y.rdx 


# 


28 


leaq 


8( ,y.r9 ,8) , y.rcx 




29 


movlpd 


(y.rdi , y.rdx , 8) , y.xmmS 


# 


30 


leaq 


8( , y.rdx ,8) , y.rax 




31 


subsd 


(y.rdi , y.r9 , 8) , y,xmm6 


# 


32 


movsd 


y.xmm6 , -24 (y.rsp) 


# 


33 


movsd 


y.xmm6 , y.xmmS 




34 


movlpd 


24 (y.rdi , y.rdx , 8) , y,xmm4 




35 


subsd 


24 (y.rdi , y.r9 , 8) , y.xmm4 


# 


36 


mulsd 


y.xmmS , y.xmmS 


# 


37 


addsd 


y.xmmlO , y,xmm3 


# 


38 


movsd 


y.xmm4 , -56 (y.rsp) 




39 


movlpd 


(y.rdi , y.rax) , y,xmm7 




40 


subsd 


(y.rdi , y.rcx) , y.xmm7 


# 


41 


movsd 


y.xmm7 , -16(y,rsp) 




42 


movsd 


y.xmm7 , y.xmmlS 




43 


movsd 


y.xmm7 , y,xmm2 




44 


movlpd 


24 (y.rdi , y.rax) , y.xmml4 




45 


leaq 


16 ( , y.rdx ,8) , y.rax 





continue 



y.rdx = 7 



y.rcx 



y.xmm6 = pr[j].x[0] 

y,xmm6 -= pr[i].x[0] 
*(y.rsp-24) = y.xmm6 



y,xmm4 
y.xmmS 
y.xmmS 



= pr [j] . v[0] 
= x*x 
+= eps2 



pr [i] . v[0] 



y,xmm7 = pr[j].x[l] - pr[i].x[l] 
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46 
47 
48 
49 
50 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
75 
76 
77 
78 
79 
80 
81 
82 
83 
84 
85 
86 
87 
88 
89 
90 
91 
92 
93 
94 
95 
96 
97 
98 
99 
100 
101 
102 
103 
104 
105 
106 
107 
108 



L9 : 



L41 



sub s d. 


24(yerdi ,%rcx) , 5'«xmml4 


# 


'/.xmm 14 


= pr [j] . V 


mu 1 s d 


%xinm7 /oXmml5 










16 ( , %r9 , 8) , y.rcx 








adds d 


^xmml5 , %xmm3 


# 


ZxmmS 


+ = y*y 


m mr n n H 

ill \J V ^ L/ Ll 


Kj\J\/^^i^LfJ f /9 ^ ill ill X \J 




load acc [0] 


ino V sd 


%xmml4 , — 48(5Crsp) 








IDUl s d 


/sXmml4 , %xmm2 








m V 1 p d 


(/ordi /»ra.x) /^xmmS 








s Tj.b s d 


C^i'di ^rcx) ^xmmS 




% xmm8 


= pr [ j ] . X [ 


rn. V s d 


^xmrnS /^xmm5 








mo V s d 


^xmm8 , ~8(%rsp) 








Tn mr ^ rf 

ill V V O nJ 


%xmm8 ^xmmO 








mo vlpd 


24(%rdi , yCrax) , >Cxmml3 








mul s d 


/gXmmS f Xxmm5 








o H K ej H 

O Ipl !■/ M Ll 


24(%rdi %rcx) %xmml3 




% xmm 1 3 


= T5T~ r i 1 V 


adds d 


%xmm5 , %xmm3 


# 


% xmm3 


+ = z*z 


m V s d 


%xmm6 %xmm5 








mu 1 s d 


%xmm4 /^xmm5 


# 


% xmm5 


= X * vx 


in V s d 


y, xmmlS , -40(%rsp) 








mill ^ {\ 

ill LL ^ U 


% xmm 13 %xmm 








add s d 


/oXmm2 J %xmm5 


# 


%xmm5 




ill w V O 


% xmm 11 J Xxmm2 




%xmm2 


= 1.0 


d i V s d 


/sXmm3 • %xmm2 


# 


% xmm2 


= 1 . 0/r2 


m V 1 p d 


-80(yrsp) , yxmm3 








add s d 


yxmmO , /flXminS 


# 


/o xmm 5 


+ = Z * V z 


s qr t s d 


%xinm2 %xiiiiiil 


# 


/o xmm 1 


= 1 . 0/r 


muX s d 


^xmm2 , %xmm& 


# 


% xmm5 


= r V / r 2 


mil n e f1 

ill ui -1- o 


48fVrdi '/rdx 8^ V^cmml 

fVif\/o^V^^y/oXu^jV^^ y /9 A ill ill X 




Xxmm 1 


*= Tarfil tn 


mul s d 


/oXmml2 f %xmm5 


# 


%xmm5 


= 3rv/r2 


muX s d 


%xmml , %xmm2 


# 


%xmm2 


= m/r3 


s ub s d 


yxmml %xmm9 




pot ~— 


m / r 


m V X p d 


~T2(y3rsp) , ^xmml 








mil X s d 


yxmm2 /oXmm6 








mu X s d 


yxmm2 %xin.mT 








mu X s d 


yxmm2 %xmm8 








mill ^ 

ill LL ^ O U 


/gXmm2 j'^xmm4 








n H H ^ H 


%xmm6 J Xxmml5 




acc [0] 


+= X [0] * 


mil n ^ H 

ill U -1- O VI 


/oXmm2 J Xxmml4 








adds d 


%xmm7 , >'^xmm3 








mu X s d 


yxmmB %xmm6 








ad d s d 


%xmm8 %xmml 








muX s d 


%xmm5 • %xmmT 








m 11 n e H 

ill U ^ O Ll 


/gXmm2 f yCxmml3 








muX s d 


%xmm5 , %xmm8 








mov s d 


%xmml5 , -88(%rsp) 


# 


store 


acc [0] 


s ub s d 


%xmm6 %xmm4 








m V s d 


yxmmS , -80 (yrsp) 


# 


store 


acc [ 1 ] 


s ub s d 


%xinm7 /oXin.ml4 








addsd 


-120(yorsp) , yxmm4 








mo vsd 


yxmml , -72(yrsp) 


# 


store 


acc [2] 




-112(70 rsp) , y»xmml4 








sub s d 


%xmm8 • >Cxmml3 








addsd 


-104C/.rsp) , y.xmmlS 








movsd 


%xmm4 , -120(%rsp) 


# 


store 


jerk[0] 


movsd 


"/.xmml4, -112(/',rsp) 


# 


store 


jerkLl] 


movsd 


%xmml3 , -104(/',rsp) 


# 


store 


jerk[2] 


incX 


•/.r8d 








cmpX 


"/.rlOd, •/.r8d 








jX .L60 


# 


if (++j 


< nbody) 


movq 


-88("/.rsp), 7.rll 








movq 


-120C/.rsp) , "/.rlO 









pr [i] . v[2] 
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109 
110 
111 
112 
113 
114 
115 
116 
117 
118 
119 
120 
121 



movq 
movq 
movq 
movq 
movq 
movq 
movq 
movq 
movq 
movq 



movsd 



•/.xmm9 , 48C/.rsi) 
-80('/.rsp) , '/.rg 
-112(y.rsp) , '/.rS 
-72C/.rsp), Irdi 



-104C/.rsp) , y.rdx 
"/.rll, ('/.rsi) 



"/.rlO, 24(%rsi) 

•/.r9, 8C/.rsl) 



addq 
ret 



•/.rS, 32('/.rsi) 
y.rdi , 16(y.rsi) 
y.rdx, 40(y.rsi) 
$16, y.rsp 



One can see that there are quite a few unnecessary load/store instructions. For 
example, the store instructions in lines 32, 38, 41, 51, 56 and 64 of list 2 serve 
no purpose. Also Arrays acc [3] and jerk [3] could be placed in registers like 
variable pot, but they are placed in memory instead. 



3 C-level optimization 

3.1 Code modification 

As we have seen in the previous section, the assembly language output shows 
a significant amount of unnecessary load/store instructions. In principle, if the 
compilers were clever enough they would be able to eliminate these unneces- 
sary operations. In practice, we need to guide the compilers so that they do 
not generate unnecessary code. 

We have achieved significant speedup using the following two guiding princi- 
ples: 

(1) Eliminate assignment to any array element in the force loop. 

(2) Reuse variables as much as possible in order to minimize the number of 
registers used. 

Apparently, present-day compilers are not clever enough to eliminate load/s- 
tore operations, if elements of arrays are used as left-hand-side values. There- 
fore, we hand-unroll all loops of length three and use scalar variables instead 
of arrays for three-dimensional vectors. 

In well-written programs, variables which contain the values for different phys- 
ical quantities should have different names. However, the use of many different 
names prevents optimization, since it results in a number of variables too large 
to be fitted into the register set of SSE2 instructions (there are 16 "XMM" 



10 



registers for SSE2 instructions). Therefore, we explicitly reuse variables such 
as X, y, z so that these can hold both the position difference components as 
well as the pairwise force components, at different times in the computation. 



The resulted C-language code is the following: 



List 3. An optimized force loop. 



typedef struct-C 

double X [3] ; 

double V [3] ; 

double m; 

double pad; 
}Predictor ; 
typedef Predictor 



* pPredictor; 



void CalcAcc Jerk_Fast (pPredictor pr , pAccJerk aj , 

double eps2 , int i, int nj){ 

int j ; 

double X , y , z , vx , vy , vz ; 
double ax.ay.az, jx, jy, jz.pot; 
double r2,rv; 
double rinv , rinv2 , r inv3 ; 

pPredictor jpr = pr ; 
pPredictor ipr = pr + i; 

pot=ax=ay=az=jx=jy=jz=0.0; 



for(j=0; j<nj ; jpr++){ 

__builtin_pref etch ( jpr+2 , 
if(j == i) continue; 
X = jpr->x[0] - ipr->x[0] 
y = jpr->x[l] - ipr->x[l] 
z = jpr->x[2] - ipr->x[2] 

vx= jpr->v[0] - ipr->v[0] 
vy= jpr->v[l] - ipr->v[l] 
vz= jpr->v[2] - ipr->v[2] 



3); 



r2 



eps2 + x*x + y*y + z*z; 



rv = vx*x + vy*y + vz*z; 

rinv2 = l./r2; 
rinv = sqrt(rinv2); 
rv *= rinv2*3 . ; 
rinv *= jpr->m; 
rinv3 = rinv * rinv2 ; 

pot -= rinv; 

X *= rinv3; ax += x; 
y *= rinv3; ay += y; 
z *= rinv3; az += z; 

vx *= rinv3 ; jx += vx ; 
vy *= rinv3 ; jy += vy; 
vz *= rinv3; jz += vz ; 



*= rv ; J X 
*= rv; jy 
*= rv ; jz 



x; 

y ; 
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55 


} 






56 








57 


aj 


->a[0] 


= ax 


58 


aj 


->a[l] 


= ay 


59 


aj 


->a[2] 


= az 


60 


aj 


->j [0] 


= jx 


61 


aj 


->j [1] 


= jy 


62 


aj 


->j [2] 




63 


aj 


->pot = 


pot 


64 


> 







This code is compiled into the following assembly code using the GCC 3.3.1 
compiler (flag -03 -f fast-math): 

List 4. Assembly output of list 3. 



1 


CalcAcc Jerk 


_Fast : 


2 


. LFB3 : 




3 


movslq 


y.edx , y.rS 


4 


xorpd 


%xmm9 , 7,xmm9 


5 


salq 


$6, "/.rS 


6 


movsd 


y.xmmO , -8('/,rsp) 


7 


1 e aq 


C/.rS ,y.rdi) , Zrax 


8 


movsd 


yoXmm9 , yxmmll 


9 


movsd 


XxmrnQ , XxmrnlO 


10 


xor 1 


y.r8d , y.r8d 


11 


movsd 


y,xmm9 , y,xmml4 


12 


movsd 


y,xmm9 , y,xmml3 


13 


cmpl 


y.ecx , y.rSd 


14 


movsd 


y,xmm9 , y,xmml2 


15 


movsd 


y,xmm9 , y,xmml5 


16 


jge .L9 




17 


. p2alig 


n 4 , ,7 


18 


.Lll : 




19 


cmpl 


y.edx , y.rSd 


20 


prefetchtO 128(/'.rdi) 


21 


je .L4 




22 


movlpd 


(y.rdi ) , y,xmm3 


23 


movlpd 


8 (y.rdi) , y,xmm4 


24 


subsd 


(y.rax) , y.xmmS 


25 


movlpd 


16 (y.rdi) , y.xmmS 


26 


subsd 


8(y.rax), y,xmm4 


27 


movlpd 


24 (y.rdi) , y.xmmS 


28 


subsd 


16(y,rax), y.xmmS 


29 


movlpd 


32 (y.rdi) , y.xmm7 


30 


subsd 


24 (y.rax) , y.xmmS 


31 


movlpd 


40 (y.rdi) , y.xmmS 


32 


subsd 


32 (y.rax) , y.xmm7 


33 


subsd 


40 (y.rax) , y.xmmS 


34 


movsd 


y.xmmS , y.xmmO 


35 


movsd 


y,xmm4 , y,xmm2 


36 


mulsd 


y.xmmS , y.xmmO 


37 


addsd 


-8(y,rsp), y.xmmO 


38 


mulsd 


y.xmm4 , y,xmm2 


39 


movsd 


y.xmm7 , %xmml 


40 


mulsd 


yoXmm4 , y.xmml 


41 


addsd 


y.xmm2 , y.xmmO 


42 


movsd 


y.xmmS , y,xmm2 


43 


mulsd 


y.xmmS , y,xmm2 


44 


addsd 


y,xmm2 , y.xmmO 


45 


movsd 


y.xmmS , y,xmm2 
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46 


mil 1 s d. 




%xmm2 


47 


cLcid. s d. 


y.xmml , 


yoXmm2 


48 


in V s d. 


y.xmmS , 


y.xmml 


49 


muX s d 


%xmm5 , 


y.xmml 


50 


a H H ^ rt 


y.xmml , 


y,xmm2 


51 


Tn mr n n H 

111 V V ^L/U 


.LC4(%rip), y.xmml 


52 


d i V s d 


y.xmmO , 


y.xmml 


53 


s t s d 


y.xmml , 


%xmmO 


54 


mil 1 s d 


y.xmml , 


y.xmm2 


55 


mu 1 s d 


48 (y.rdi ) , y.xmmO 


56 


muX s d 


.LC5(y.rip), y.xmm2 


57 


m 11 n ^ f1 

111 U -1- O V.1 


%xmmO . 


Xxmm 1 


58 


subs d 


%xmmO . 


%xmm 1 S 


59 


m 11 n ^ H 

HI U ^ M LI 


%xmml 


% xmmS 


60 


muX s d 


y xmm 1 


xmm4 


61 


niuX s d 


y.xmml , 


y.xmmS 


62 


mu X s d 


y.xmml , 


y,xmm6 


63 


mu X s d 


y.xmml , 


y.xmm7 


64 


cid d s d 


y.xmmS , 


y.xmml 2 


65 


mill ^ d 

ill LL ^ i3 U 


y.xmml , 


y.xmmS 


66 


3.dd s d 


y.xmm4 , 


y.xmml 3 


67 


a d d ^ d 

CLLl Ll O Ll 


y.xmmS , 


y.xmml 4 


68 


mu X s d 


y.xmm2 , 


y.xmmS 


69 


3. d d s d 


y.xmmG , 


y.xmmlO 


70 


mu X s d 


y.xmm2 , 


yflXmm4 


71 


add s d 


y.xmm? , 


y.xmml 1 


72 


muX s d 


y.xmm2 , 


y.xmmS 


73 


3.dd s d 


y.xmmS , 


y.xmmS 


74 


subsd 


y.xmmS , 


y.xmmlO 


75 


subsd 


y.xmm4 , 


y.xmml 1 


76 


subsd 


y.xmmS , 


y,xmm9 


77 


.L4: 






78 


inc X 


y.rSd 




79 


addq 


$64 , y.rdi 


80 


cmpX 


y.ecx , y.rSd 


81 


jl .Lll 




82 


.L9: 






83 


movsd 


y.xmml 2 , 


(y.rsi) 


84 


movsd 


y.xmml 3 , 


S(y.rsi) 


85 


movsd 


y.xmml 4 , 


16(y.rsi) 


86 


movsd 


y.xmmlO , 


24(y.rsi) 


87 


movsd 


y. xmm 1 1 , 


S2(y.rsi) 


88 


movsd 


y.xmm9 , 


40(y.rsi) 


89 


movsd 


y. xmm IS , 


4S(y.rsl) 


90 


ret 







Table 2 

Performance of list 3 when N = 1024, in cycles-per-interaction and Gflops, on 
Athlon 64 3000+ 2.0GHz. 



Compiler 




Options 


Cycles 


Gflops 


GCC 3.3.1 


-03 


-f fast-math 


64.8 


1.85 


GCC 4.0.1 


-03 


-f fast-math 


64.7 


1.85 


ICC 9.0 


-03 




79.4 


1.51 



We can see that now all load/store instructions of the intermediate results 
are eliminated. Figure 1 shows the use of registers and table 2 presents the 
performance data for this code. This version is 46% faster than the code in 
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/oxiniriu 


rinVj r2 


%xiniiil 


iriiiv2 , rinvS , y*vy, z*vz 


%xnini2 


^v, y*y, z*z 


/OXillillO 


X 


%xiiiiii4 


y 


%xinin5 


z 


%xinin6 


vx 


/oxmin / 




/oXiiiilio 


VZ 


%xnini9 




%xiiiiiilO 




/OXlIllIll 1 


jy 


1 11 1 11 1 

/( .A .LU.J.Ii_l_Z 


clX 


%xminl3 


ay 


%xminl4 


az 


%xmml5 


pot 



Fig. 1. The use of XMM registers in list 4. 



section 2. Very roughly, this speed-up is consistent with the reduction in the 
number of assembly language instructions (from 82 to 62), but is somewhat 
larger. This is partly because the instructions which take memory arguments 
have a larger latency than register-only instructions. 



3.2 Prefetch insertion and alignment 

Line 22 of list 3 shows a built-in function for GCC, which is compiled into 

a prefetch instruction. In this case, the prefetch instruction loads the data 
which will be needed two iterations after the current one. The second and 
third parameters are read/ write flags for which zero indicates preparing for a 
read and the degree of temporal locality takes value from 1 to 3. 

Line 6 of list 3 shows a pad to make the size of the predictor structure to 
be exactly 64-byte, which is the cache-line size of Athlon 64 processor. To 
make the predictor structure align at 64-byte boundary, we use memalignO 
or posix_memalign() instead of mallocO. 
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74 - 



■A 



pf, align ------- 

pf, no align ---A--- 

no pf, align — ■ — 
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/ 




■-Q -{Zf 



64 



J ■ ■ ■ I 
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10000 



N 



Fig. 2. Clock cycles per force calculation loop as a function of A'^ on Athlon 64 
3000+ 2.0 GHz. Open squares and triangles show the results of using code with 
prefetch and with and without 64- byte alignment. Filled squares and triangles show 
the result of using code without prefetch and with and without 64-byte alignment. 

Figure 2 shows the performance of the force loop as the function of the loop 
length A^, with and without this prefetch instruction and 64-byte alignment. 
The fastest case depends on N. For the region N < 1024, in which data fits 
into 64 kB of LI data cache, the code with prefetch and 64-byte alignment is 
the best. For larger N, the code with prefetch and without ahgnment is the 
best. 



4 Assembly-level optimization 

In this section, we give two extensions specialized for x86/x86_64 architecture 
for the force loop. The first is using SSE2 vector (SIMD) mode instead of SSE2 
scalar (SISD) mode. The second is replacing one division and one square root, 
by a special instruction for fast approximate inverse square root in SSE and 
Newton-Raphson iteration. 

4-1 SSE2 vector mode 

In the previous section we improved the performance of the C-language im- 
plementation of the force loop, essentially by hand-optimizing the C code so 
that the generated assembly code becomes optimal. However, in this way, we 
did not use the full capability of SSE2. While SSE2 instructions can process 
two double-precision numbers in parallel, the force loop discussed in the pre- 
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vious section uses only one of these two words. Clearly, we did not yet use the 
"SIMD" nature of the instruction set. 

Whether or not we can gain by using this SIMD nature depends on the partic- 
ular code, and also on the particular processor. On Intel P4 architecture the 
SIMD mode can offer up to a factor of two speedup, while on AMD Athlon 
64 or Intel Pentium M, the speed increase can be small (or even negative). 

There are many ways to use this SIMD feature. Since there are similarities 
with the vector instructions of some old vector processors, in particular the 
Cyber 205, one could make use of an automatic vectorizing compiler (such as 
ice version 6.0 and later). However, just as was the case with the old vectorizing 
compilers, the vectorizing capability of modern compilers are still very limited, 
and it is hard to rewrite the force loop so that the compiler can make use of 
the SIMD capabihty. 

In fact, part of the reason why vectorization is difficult is that the load/store 
capability of SSE2 instructions is rather limited: it can work efficiently only 
on a pair of two double-precision words in consecutive and 16-byte aligned 
addresses. This means that the basic loop structure cannot work, and one 
needs to copy the data of two particles into some special data structure, in 
order to let the compiler generate the appropriate SIMD instructions. 

Here we have adopted a low-level approach, in which we make use of the spe- 
cial data type of pairs of double precision words defined in GCC. Basically, 
this data type, which we call v2df , corresponds to what is in one XMM reg- 
ister (a pair of double-precision words). We can perform the usual arithmetic 
operations and even function calls for this data type. Here is the code which 
makes use of this v2df data type: 

List 5. Defining SSE/SSE2 data type. 

1 typedef double v2df attribute.. ( (mode ( V2DF) ) ) ; 

2 typedef float v4sf attribute ( (mode (V4SF) ) ) ; 



The v4sf data type packs four single-precision words for SSE which we will 
use later. 

The basic idea here is to calculate the forces from one particle on two different 
particles in parallel. One could try to calculate the forces from, rather than 
on, two different particles on one particle in parallel, but that would result 
in a more complicated program since then we will need to add up the two 
partial forces in the end. Also, from the point of view of memory bandwidth, 
our approach is more efficient, since we need to load only one particle per 
iteration. Note that this is the same approach as what is called i-parallelism 
in the various versions of GRAPE hardware, where parallel pipelines calculate 
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the forces on different particles from the same set of particles. 

In this code, we use macros for gathering load and scattering store using 
instructions for loading/storing the higher/lower half of an XMM register: 



List 6. Gathering load and scattering store for SSE2. 



1 


#d6fine V2DF_GATHER (reg , ptrO , ptrl) \ 




2 


reg = __bulltin_la32_loadlpd (reg , (void 


*)(ptrO)); \ 


3 
4 


reg = __builtin_ia32_loadhpd (reg , (void 


*) (ptrl)) ; 


5 


#define V2DF_SCATTER (reg , ptrO , ptrl) \ 




6 


__builtin_ia32_storelpd((void *)(ptrO) , 


reg) ; \ 


7 


__builtin_ia32_storehpd ((void *)(ptrl), 


reg) ; 



Wc should be careful about the fact that these built-in functions arc undocu- 
mented feature of GCC The GCC document describes built-in functions for 
most SSE/SSE3 instructions but not for any of SSE2 instruction. We can 
call most SSE2 instruction through the prefix __builtin_ia32_. Note that, for 
whatever reason, instructions such as "movxxx" from/to memory are changed 
to " loadxxx" /" storexxx" . But this does not mean that GCC always generates 
correct assembly output. For example, GCC generates wrong code for the first 
macro in list 6 when reg is not a register variable. This can be considered ei- 
ther a bug or a feature, and it shows the risk of using these undocumented 
aspects of GCC. 

We also use macros for numerical values since we cannot use numerical values 
like "3.0" for vector operations: 



List 7. Macros for numerical values. 



1 


#define V2DF_0P_SCALAR (reg , 


op, imm) 


{ \ 


2 


v2df v2df_temp = {imm. 


imm}; \ 




3 


reg op v2df_temp; \ 






4 


} 






5 


#define V2DF_0P_VECTQR (reg , 


op , immO , 


imml) { \ 


6 


v2df v2df_temp = {immO, 


imml}; \ 




7 

8 


reg op v2df_temp; \ 

} 






9 


#define V4SF_0P_SCALAR (reg , 


op, imm) 


{ \ 


10 


v4sf v4sf_temp = {imm. 


imm , imm , 


imm}; \ 


11 


reg op v4sf_temp; \ 






12 


} 







Note that we need to copy the data of a single particle into both high and low 

words of an XMM register. The new SSE3 extension supports this "broad- 
casting", while the original SSE2 set did not. Therefore, we use the following 
macro: 

List 8. Broadcast loading for double precision. 

1 #ifdef __SSE3__ 

2 #define LOADDDUP (reg , ptr) \ 
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reg = __builtin_ia32_loadddup (ptr) ; 

#else 

#deflne LOADDDUP (reg , ptr) \ 

reg = __builtin_ia32_loadlpd (reg , (void *)(ptr)); \ 
reg = __builtin_ia32_loadhpd (reg , (void *)(ptr)); 



We show the vectorized force loop using these data types and macros in hst 
9. 



List 9. Vectorized force loop using SSE2 vector mode. 



typedef struct{ 

double X [3] ; 

double V [3] ; 

double m [2] ; 
}Predictor ; 

typedef Predictor * pPredictor; 

void CalcAcc Jerk_Vector (pPredictor pr , pAccJerk AccJerkOut , 

double eps2 , int index [] , int nj){ 

int j, k; 

v2df x,y,z,vx,vy,vz,ax,ay,az,jx,jy,jz,pot; 

v2df r2,rv; 

v2df r invl , r inv2 , r inv3 ; 

static v2df zero = {0.0, 0.0}; 

pPredictor prj = pr; 

int iO = index [0] ; 

int il = index[l] ; 

v2df xi [7] ; 

v2df *vi = xi+3; 

v2df *eps2p = xi+6; 

pot =ax=ay =az= j x= j y= j z = zero; 

for (k=0;k<3;k++){ 

V2DF_0P_VECT0R(xi [k] , =, pr[iO].x[k], pr [i 1] . x [k] ) ; 
V2DF_0P_VECT0R(vi [k] , =, pr[iO].v[k], pr [i 1] . v [k] ) ; 

} 

V2DF_DP_SCALAR (*eps2p , =, eps2); 
for(j=0; j<nj ; j++,prj++){ 

LOADDDUP (x, prj->x); 

X -= xi [0] ; 

LOADDDUP (y, prj->x+l); 
y -= xi [1] ; 

LOADDDUP (z, prj->x+2); 

z -= xi [2] ; 

LOADDDUP (vx, prj->v); 

vx -= vi [0] ; 

LDADDDUP(vy, prj->v+l); 

vy -= vi [1] ; 

LOADDDUP (vz, prj->v+2); 

vz -= vi [2] ; 

r2 = x*x + y*y + z*z + *eps2p; 

rv = vx*x + vy*y + vz*z; 
__builtin_pref etch (pr j +2 , 0, 3); 

V2DF_0P_SCALAR (rinv2 , =, 1.0); 
rinv2 /= r2; 

rinvl = __builtin_ia32_sqrtpd (rinv2) ; 
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} 



rv *= rinv2 ; 
V2DF_0P_SCALAR (rv , *= , 



3.0) 



rinvl *= *(v2clf *)(prj->m); 

pot -= rinvl; 

rinv3 = rinvl * rinv2; 



58 




X * 


= r inv3 ; ax += x ; 


59 




y * 


= rinv3; ay += y; 


60 




z * 


= rinv3; az += z; 


61 








62 




vx 


*= rinv3; jx += vx ; 


63 




vy 


*= rinv3; jy += vy ; 


64 




vz 


*= rinv3; jz += vz ; 


65 








66 




X 


*= rv; jx -= x; 


67 




y 


*= rv; jy -= y; 


68 




z 


*= rv; jz -= z; 


69 


} 






70 








71 


{ 


/* 


correct self interaction 



v2df mass = {pr [iO] . m [0] , pr [i 1] . m [0] } ; 
V2DF_0P_SCALAR (mass , *= , 1 . 0/ sqrt ( eps2 ) ) ; 
pot += mass; 



V2DF_SCATTER(ax, AccJerkOut[0] .a, 

V2DF_SCATTER (ay , Acc JerkOut [0] . a+1 , 

V2DF_SCATTER (az , Acc JerkOut [0] . a+2 , 

V2DF_SCATTER ( jx , Acc JerkOut [0] . j , 

V2DF_SCATTER ( jy , Acc JerkOut [0] . j+1 , 

V2DF .SCATTER (jz , AccJerkDut[0].j+2, 



AccJerkOut [1] .a) ; 
Acc JerkOut [1] . a+1) ; 
AccJerkOut [1] . a+2) ; 
AccJerkOut [1] .j) ; 
AccJerkOut [1] .j + 1) ; 
AccJerkOut [1] . j+2) ; 
V2DF_SCATTER (pot , feAcc JerkOut [0] . pot , &Acc JerkOut [1] . pot ) : 



The predictor structure has changed to store the same mass value in two places 
(line 4), in order to save an instruction (line 54). 



4-2 Fast approximate inverse square root 



The most expensive part of the force calculation, for recent microprocessors, 
is the calculation of an inverse square root which requires one division and one 
square root caluclations. Several attempts to speed this up using table lookup, 
polynomial approximation and Newton-Raphson iteration have been reported 
(e.g. Karp 1992). The main difficulty in these approaches is how to get a good 
approximation for the starting value of a Newton-Raphson iteration quickly. 

Here, we use RSQRTSS/PS instruction in the SSE instruction set, which pro- 
vide approximate values for the inverse square root for scalar /vector single- 
precision floating-point numbers, to an accuracy of about 12 bits. With one 
Newton-Raphson iteration, wc can obtain 24-bit accuracy, which is sufficient 
for most "high-accuracy" calculations. If higher accuracy is really necessary, 
we could apply a second iteration. The Newton-Raphson iteration formula is 
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expressed as: 



xi^ -^xo{axl-3). (4) 
Here, xo is an initial guess for 1 / y/a. 

We show the implementation for a scalar version using inline assembly code 
with GCC extensions: 

List 10. Calling rsqrtss through inline assembly code. 



double r2 , rinv; 
float ftmp; 

ftmp = (float)r2; 

asmC'rsqrtsSu'/.l ,u'/.0" : "=x" (ftmp) : "x" (ftmp)) ; 

rinv = (double ) f tmp ; 

rinv *= r2*rinv*rinv - 3.0; 



and for a vector version using built-in functions: 

List 11. Calling rsqrtps using built-in functions of GCC. 



1 


v2df 


r2 , rinv ; 


2 

3 


v4sf 


ftmp ; 


4 


f tmp 


= __built in_ ia32_c vtpd2ps (r 2 ) ; 


5 


ftmp 


= __builtin_ia32_rsqrtps (ftmp) ; 


6 


rinv 


= __builtin_ia32_cvtps2pd (f tmp) ; 


7 


r2 * 


= rinv; 


8 


r2 * 


= rinv; 


9 


V2DF 


_0P_SCALAR(r2, -= , 3.0); 


10 


rinv 


*= r2 ; 


Note that 


we skip here the multiplication by 



■1/2 in equation (4). This can 
be done after the total force is obtained. 

To use RSQRTSS/PS instructions, we first convert into single precision, 
then apply these instructions, and finally convert the result back to double 
precision. 

Note that the actual value returned by this RSQRTSS/PS instruction is im- 
plementation dependent. In particular, the AMD Athlon 64 processors and the 
Intel Pentium 4 processors return different values. Figure 3 shows the errors 
of the return values as a function of the input values. The AMD implementa- 
tion has smaller average errors, but at the same time they show a relatively 
large systematic bias. Even after one Newton-Raphson iteration in double 
precision, results from both implementations show relatively large biases. Ta- 
ble 3 presents the root mean square error, max error, and bias (mean error) 
of the approximate value 1/^/x, on a Intel Pentium 4 and an AMD Athlon 
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Fig. 3. Relative error of the return value of RSQRTSS/PS instructions on AMD 
Athlon 64 and Intel Pentium 4 for 1 < x < 16 (upper) and 1.5 < a; < 1.55 (lower). 
The errors is periodic for powers of 4. 



64 before/after one Newton- Raphson iteration. Here the error is measured as 
l/2(a;[rsqrt(a;)]^ — 1), and the weight is l/frac(a;), where frac(x) is the normal- 
ized fraction of the floating-point number x. We can correct these biases, at 
least statistically, by multiplying the resulted force by constants which depend 
on the processor type used. 



4-3 Performance 



Table 4 shows the performance of four different types of force loops, using 
SSE2 scalar/vector mode and with/without fast inverse square root. 
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Tabic 3 



Errors of approximate inverse square root. 








RMSE MAX error 


Bias 




AMD, before N-R 


7.96 X 10-5 2.59 x 10"^ 


2.21 X 10-5 




AMD, after N-R 


1.57 X 10-^ 1.00 X 10-'^ - 


-9.51 X lO-'^ 




Intel, before N-R 


1.16 X 10-^ 3.26 X 10-^ - 


-8.37 X 10-^ 




Intel, after N-R 


3.05 X 10-^ 1.60 X 10-'^ - 


-2.01 X 10-^ 




IEEE 754 Single 


3.58 X 10-^ 8.94 x 10"^ 


1.52 X 10"^^ 




Tabic 4 

Performance of SSE2 scalar mode and vector mode. 




SSE2 mode 


scalar 


vector 


sqrt operation 


normal fast 


normal 


fast 


Cycles per interaction 64.8 cycle 70.5 cycle 


69.0 cycle 


50.0 cycle 


Calculation speed 


1.85 Gflops 1.70 Gflops 


1.73 Gflops 


2.40 Gflops 



5 Mixed-precision force loop 



As we have summarized in the introduction, SSE2 is the SIMD instruction set 
for pairs of double-precision words. There are also SSE instructions that work 
on quadruples of single-precision words. Thus, using the SSE instruction set, 
we can in principle double the performance. If we perform the initial subtrac- 
tion between positions and final accumulation of acceleration and potential in 
double-precision, we can use single-precision SSE functions for all other cal- 
culations, including subtraction between velocities and accumulation of jerk, 
and still maintain a pretty high accuracy. The main complexity here is the 
question of how to make use of four elements of SSE data type, in parallel. 
The simplest approach is to calculate the forces on four particles, but in that 
case we need too many variables, which do not all fit into the registers. We 
need two XMM registers for each element of force in this way. Instead, we 
have tried to achieve the maximum speed by calculating the forces from two 
particles on two particles (making a total of four force calculations) in parallel 
in SSE. 

In the following, we present a detailed description of the implementation of 
this mixed-precision force loop using SSE/SSE2. We use the term i particles 
for particles which feel the gravitational force, and j particles for particles 
which exert the force. 
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5. 1 The data structure for j particles 

The data structure for the j particles should contain two particles. The code 
in hst 12 achieves this goal by using the data types v2df and v4sf . Note 
that for velocities we use the single-precision v4sf data type, since wc do not 
need double-precision accuracy for the velocity. We store the data of the two 
particles (with indices j andj+1) as + for position, ands {j, j, j + 
for velocity and mass. 

List 12. The ?" particle structure. 

1 typedef struct{ 

2 v2df X [3] ; 

3 v4sf v[3] ; 

4 v4sf m; 

5 v4sf pad; 

6 }Jppack; /* 128-byte */ 

7 typedef Jppack * pjppack; 



The variable pad is used to make the size of the structure an exact multiple 
of 64 bytes. 



5.2 The data structure for i particles 

We use the following local variables to store two i particles (with indices io 
and ii). 



List 13. Two i particles packed into local variables. 



1 


v2df 


xiO [3] 


= {{x[iO] [0] , 


x[iO] [0]} , {. . .} , 


{...}}; 




2 


v2df 


xil [3] 


= {{x[il] [0] , 


x[il] [0]} , {. . .} , 


{...}}; 




3 


v4sf 


vi[3] 


= {{v[iO] [0] , 


v[il] [0] , v[iO] [0] 


, v[il] [0]} , {...}, {. 


.}}; 



Note that xiO keeps the position of particle io in a duplicated way (two 64- 
bit words of v2df data type store the same data). Similarly, xil keeps the 
data for ii. In this way, by subtracting x of the j particle variable from xO or 
xl, we can calculate the displacement of two j particles from one i particle. 
The results are then converted to single-precision format using a cvtpd2ps 
instruction. (See list 14). After unpcklps (an instruction to pack two words 
into one words, not to unpack) is issued, the contents of x become {xj — 

For the velocity, we store the data in the order of {ig, ii, io, ii), so that a single 
subtraction operation provides the four displacement vectors of two j particles 
from two i particles. 
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List 14. Subtraction between positions. 



1 


pjppack pj ; 










2 


v2df xi0[3] 


xil [3] ; 








3 












4 


for(j=0; j<n 


j+=2, pj++){ 








5 


v4sf X, 


y , z , f tmp ; 








6 


X = 


__builtin_ia32_ 


cvtpd2ps (pj 


->x[0] - 


xiO [0] ) ; 


7 


f tmp = 


__builtin_ia32_ 


cvtpd2ps (pj 


->x[0] - 


xil [0] ) ; 


8 


X = 


__builtin_ia32_ 


unpcklps (x , 


ftmp) ; 





5.3 Inverse square root and Newton-Raphson iteration 



We use the same NR-itcration as in the previous section. Since we do not need 
the data conversion between single and double precision formats, the code here 
actually becomes simpler. List 15 gives this part of the code. 

List 15. Calculation of inverse square root in v4sf . 

1 v4sf r2 , rinv; 

2 

3 rinv = __bui It in_ i a32_r sqr tp s (r 2 ) ; 

4 r2 *= rinv; /* start Newton-Raphson */ 

5 r2 *= rinv; 

6 V4SF_QP_SCALAR(r2, -= , 3. Of); 

7 rinv *= r2; /* nom , rinv = -2.0/r */ 



5.4 Accumulating acceleration and potential 



Before accumulation of acceleration and potential, we now have to convert 
single-precision data back to double precision. Before doing so, we need to split 
one piece of 128-bit data with 4 single-precision words into two (effectively) 
64-bit words with two single-precision words. This is done by the movhlps 
instrTiction. Then we convert the result to double precision using a cvtps2pd 
instruction. The code appears in list 16. 

List 16. Accumulating x element of acceleration. 

1 v4sf X, rinv3 , ftmp; 

2 v2df ax; 

3 

4 X *= rinv3; 

5 ftmp = __builtin_ia32_movhlps (f tmp , x) ; 

6 ax += __built in_ ia32_ c vt ps2pd (x) ; 

7 ax += __built in_ ia32_c vtps2pd (f tmp ) ; 



For the jerk, we directly accumulate them in quadruples of single-precision 
words, since we do not need double-precision accuracy for jerk. After total 
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force is obtained, we add up higher two words and lower two words of the 
accumulated data. 



5.5 The whole code 



List 17 shows the entire code. We provide a simple library which one can call 
to use this function in a way similar to the way the GRAPE hardware is used. 

We show an example iV-body code using this library in Appendix 8. Note 
that actual code in list 17 is slightly different from code in list (12-16). We 
use arrays instead of vector types in the j particle structure. Codes in list 14 
and 16 are abstracted into macros for readability and saving of registers. 

List 17. Total code of mixed precision force calculation. 



static v4sf frv.coeff = {0.75, 0.75, 0.75, 0.75}; 
static v2df PotCorrect = {-0.5, -0.5>; 
static v2df AccCorrect = {-0.125, -0.125}; 

static v4sf V4sfEps2 = {1./65536. , 1./65536. , 1./65536. , 1./65536.}; 

static v2df Epslnv = {256.0, 256.0}; 
static int Nbody; 
static pjppack Jptr; 

void SSEGRAV_Init ial ize ( int nbody, double eps2){ 

double bias = rsqrt.bias (f rsqrt.lNR) ; /* measure bias */ 
double epsinv = 1 . 0/sqrt (eps2) ; 

V2DF_DP_SCALAR (PotCorrect , =, -0 . 5* ( 1 . -bias ) ) ; 
V2DF_0P_SCALAR (AccCorrect , =, -0.125*(1.0 - 3.0*bias)); 

Eps2 = eps2 ; 

V2DF_0P_SCALAR (V2df Eps2 , =, eps2); 
V4SF_0P_SCALAR(V4sfEps2 , = , (f loat ) eps2 ) ; 
V2DF_0P_SCALAR (Epslnv , =, epsinv); 

V4SF_0P_SCALAR (frv.coeff , =, 0.75*(1.0 - 2*bias)); 
Nbody = nbody; 

Jptr = memalign(64, ( ( 1+nbody ) /2) * sizeof ( Jppack) ) ; 

} 

#define SUB_PACK(fx, ptr , xiO , xil , ftmp){ \ 

fx = __builtin_ia32_cvtpd2ps (* (v2df *) (ptr) - xiO) ; \ 
ftmp = __builtin_ia32_cvtpd2ps (*(v2df *) (ptr) - xil); \ 
fx = __builtin_ia32_unpcklps (f X , ftmp); \ 

} 

#define ACCUML ATE (phi , op, fx, ftmp){ \ 

ftmp = __built in_ ia32_mo vhlps ( f tmp , fx); \ 
phi op __built in_ ia32_ c vt ps2pd (f x) ; \ 
phi op __builtin_ia32_cvtps2pd (f tmp) ; \ 

} 

static inline v2df hadd_ps2pd ( v4sf x){ 

v4sf tmp = __builtin_ia32_movhlps (tmp , x); 

return __builtin_ia32_cvtps2pd (x) + __bui It in_ i a32_c vtps2pd ( tmp ) ; 

} 

void SSEGRAV.CalcAcc JerkPot (pAcc Jerk ajout, int *index){ 
int j, k; 
int nj = Nbody; 
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44 
45 
46 
47 
48 
49 
50 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
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67 
68 
69 
70 
71 
72 
73 
74 
75 
76 
77 
78 
79 
80 
81 
82 
83 
84 
85 
86 
87 
88 
89 
90 
91 
92 
93 
94 
95 
96 
97 
98 
99 
100 
101 
102 
103 
104 
105 
106 



v2df ax, ay, az , pot; 
v4sf jx, jy, jz; 

pjppack pr = Jptr , prj = Jptr; 

int iO = indexCO] , il = index[l] ; 

static v2df zero = {0.0, 0.0}; 

static v4sf fzero = {0.0, 0.0, 0.0, 0.0}; 

v2df xiO [3] , xil [3] ; 

v4sf vi [3] ; 

v4sf feps2 = V4sfEps2; 

pot = ax = ay = az = zero; 
= jy = jz = fzero; 

/* set i-particle */ 
for (k=0;k<3;k++){ 

float viO = pr [iO/2] .v[k] [2*(i0'/,2)] ; 

float vil = pr [il/2] .v[k] [2*(iiy.2)] ; 

V2DF_0P_SCALAR (xiO [k] , =, pr [ 10/2] . x [k] [( 107,2 )] ) 
V2DF_0P_SCALAR(xil [k] , =, pr [i 1/2] . x [k] [ ( i 1%2 ) ] ) 
V4SF_0P_VECT0R(vi [k] , =, viO , vil, viO , vil); 

} 

/* force loop */ 
for(j=0; j<nj ; j+ = 2 ,prj+ + ){ 

v4sf X , y , z , vx , vy , vz ; 

v4sf r2 , rv , rinv, rinv2 , rinv3; 

__builtin_pref etch (pr j+2 , 0, 3); 

builtin.pref etch (8+ (double *)(prj+2), 0, 3); 

SUB_PACK(x, prj->x[0], xi0[0], xil[0], vx); 
SUB_PACK(y, prj->x[l], xiO[l], xil[l], vy); 
SUB_PACK(z, prj->x[2], xlO [2] , xil[2], vz); 

vx = * (v4sf *) (prj ->v [0] ) - vi[0]; 
vy = * (v4sf *) (prj ->v [1] ) - vi[l]; 
vz = * (v4sf *) (prj ->v [2] ) - vi [2] ; 

r2 = x*x + y*y + z*z + feps2; 
rv = vx*x + vy*y + vz*z; 

rinv = __builtin_ia32_rsqrtps (r2) ; 
r2 *= rinv; 
r2 *= rinv; 

V4SF_0P_SCALAR(r2, -= , 3.0); 
rinv *= r2; 

rinv2 = rinv * rinv; 

rv *= frv_coeff; /* 3/4(l-2bias) */ 
rv *= rinv2 ; 

rinv *= *(v4sf *)(prj->m); 

rinv3 = rinv * rinv2 ; 

vx *= rinvS; jx += vx ; 
vy *= rinv3 ; jy += vy ; 
vz *= rinv3; jz += vz ; 

ACCUMLATE (pot , -= , rinv, vx) ; 
X *= rinv3; 
y *= rinv3; 
z *= rinv3; 
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Table 5 

Performance of list 17 in cycles-per-interaction and Gflops, on Athlon 64 3000+ 
2.0GHz, when N = 1024. The first column is performance of the output by GCC 
3.3.1 with option -03, the second is that of hand-tuned assembly code after GCC. 



GCC 


Hand-tuning 


30.6 cycles 


29.6 cycles 


3.92 Gflops 


4.05 Gflops 
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ACCUMLATE (ax , 


+= , X , vx ) ; 


108 




ACCUMLATE (ay , 


+=. y. vy) ; 


1 HQ 




ACCUMLATE (az , 


+= , z , vz) ; 


110 








111 




X *= r V ; j X 


-= X ; 


112 




y *= rv; jy 


-= y; 


113 




z *= rv ; j z 


-= z; 


114 


} 






115 








116 


/* 


post loop procedure */ 


117 


pot 


*= PotCorrect 


; /* -l/2(l-bias) */ 


118 


{ 






119 




v2df mass = {pr [iO/2] . m [2* ( ioy.2) ] , pr [i 1 /2] . m [2* ( i 1%2 ) ] } ; 


120 




pot += mass * 


Epslnv; /* phi_i += m_i/eps */ 


121 


} 






122 


ax 


*= AccCorrect ; 


/* -l/8(l-3bias) */ 


123 


ay 


*= AccCorrect ; 




124 


az 


*= AccCorrect ; 




125 


{ 






126 




v2df djx , djy 


, djz; 


127 




djx = hadd_ps2pd ( j x) * AccCorrect; 


128 




djy = hadd_ps 


2pd(jy) * AccCorrect; 
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djz = hadd_ps2pd ( jz) * AccCorrect; 



V2DF_SCATTER (ax , ajout[0].a, ajout[l].a); 
V2DF_SCATTER (ay , aj out [0] . a+1 , aj out [1] . a+1) ; 
V2DF_SCATTER(az , a j out [0] . a+2 , aj out [1] . a+2) ; 
V2DF_SCATTER (djx , ajout[0].j, ajout[l].j); 
V2DF_SCATTER(djy , a j out [0] . j +1 , aj out [1] . j + 1) ; 
V2DF_SCATTER (djz , a j out [0] . j +2 , a j out [ 1 ] . j +2 ) ; 
V2DF_SCATTER (pot , fea j out [0] . pot , &a j out [ 1] . pot ) ; 



5. 6 Performance 



Table 5 gives the performance of the code listed above. The second cohimn 
gives the performance of the code after hand-tuning the assembly output. 

The best performance we have measured is 4.05 Gflops, or 3.19 times faster 
than the original C code described in section 2. 
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6 Discussion and Summary 



In this paper we describe in detail various ways of improving the performance 
of the force-calculation loop for gravitational interactions between particles. 
Since modern microprocessors have many instructions which cannot be easily 
exploited by existing compilers, we can achieve quite a significant performance 
improvement if we write a few small library in assembly language and / or using 
instruction-set specific extensions for the compiler offered to us. 

Our implementation will give a significant speedup for almost any A^-body in- 
tegration program. In addition, we believe that similar optimization is possible 
in many other compute-intensive applications, within astrophysics as well as 
in other areas of physics and science in general. 

The source code and documentation are available at: 

ihttp : //grape . astron. s .u-tokyo . ac . jp/~nitadori/phantom/ 
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8 Sample A-body code using Hermite hierarchical time step scheme 

We show sample A-body code using Hermite hierarchical time step scheme 
and the library in subsection 5.5. 

1 #include <stdio.h> 

2 #include <stdlib.h> 

3 #iiiclude <malloc.h> 
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4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 



#define PREFETCH ( addr ) __built in_pr ef et ch ( addr , 0, 3); 
#define PREFETCHW ( addr ) __builtin_pref etch (addr , 1, 3); 
#define MEGAHELTZ 2000 

struct AccJerk{ 
double a [3] ; 

double j [3] ; 
double pot; 

}; 

struct Particle{ 

double X [3] ; 

double V [3] ; 

double a [3] ; 

double j [3] ; 

double m; 

double pot ; 

double time ; 

double timestep; 
}; /* 128-byte */ 

void CalcAcc JerkUsing_iIndex 

(int nbody , int *ilndex, int iNum , struct AccJerk *ajout) { 
int NumPipe = 2; 
int iNumRest = iNum; 

while (iNumRest ) { 

int ni = MIN(NumPipe, iNumRest); 
if(ni == 1) ilndex[l] = ilndex[0]; 

SSEGRAV_CalcAcc JerkPot (ajout , ilndex, nbody); 

iNumRest -= ni ; 
ilndex += ni ; 
ajout += ni; 

} 

} 

int GravityDriver (char *parmfile){ 
int i , k ; 
int nbody; 

double epsinv , TimeEnd , dEOutTime , MaxStep , Eta_s , SqrtEt 
double SystemTime = 0.0; 
double eps2; 
double InitEnergy ; 

FILE *fpin = NULL, *fpout = NULL; 
struct Particle *ptcl; 
struct AccJerk *ajnew; 
int *ilndex ; 

long NumPtclStep = 0, NumStep = 0; 
long tscO, tscl; 
double cycle ; 

long CyclePredict = 0, CycleForce = 

CycleCorrect = 0, CycleTot = 0; 
double WallTime; 

GetPar am (parmf i le , feepsinv , STimeEnd , feMaxStep , 

fedEOutTime , &Eta_s , ftSqrtEta , ftfpin, ftfpout); 
if (epsinv == 0.) eps2 = 0.; 
else eps2 = 1 . 0/ (epsinv*epsinv) ; 

assert(fpin != NULL); 
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assert (1 == f scanf (f pin , "%d\n" , fenbody)); 

printf ( "Nu = u'/.d\n" , nbody) ; 

ptcl = memalign(64, nbody * s izeof ( struct Particle)); 
ajnew = memalign(64, (1+nbody) * sizeof (struct AccJerk)); 
ilndex = calloc(l+nbody , sizeof (int) ) ; 

ReadSnapshot (f pin , nbody, ptcl, feSystemTime ) ; 

SSEGRAV_Initialize (nbody , eps2/*, epsinv*/); 
BSTEP.Initialize (nbody , 0.0, MaxStep); 

SSEGRAV.Predict (ptcl , SystemTime/* , nhody*/) ; 

/* initial step */ 

f or ( 1=0 ; i <nbody ; i++ ) ilndex[i] = i; 
rdtscll (tscO) ; 

CalcAccJerkUsing_iIndex (nbody , ilndex, nbody, ajnew); 
rdtscll (tscl) ; 

cycle = (tscl - tscO) / ( (double ) nbody * nbody); 
printf ("%fuCycleuperuinterruction,u7iFuGflops\n\n" , 
cycle , MEGAHELTZ/1 . e3/cycle*60 . 0) ; 

for(i=0;i<nbody ;!++){ 
for (k=0 ; k<3 ; k++) { 

ptcl[i].a[k] = a j new [i ] . a [k] ; 
ptcl[i].j[k] = ajnew [i] . j [k] ; 

} 

ptcl[i].pot = ajnew [i] . pot ; 
InitTimestep (ptcl+i , Eta_s , MaxStep); 

BSTEP_PushAParticle (i , ptcl [i] . timestep , ftptcl [i] . timestep) ; 

} 

InitEnergy = GetEnergy (ptcl , nbody); 

PrlntEnegy (ptcl , nbody, InitEnergy, SystemTime); 

/* evolve */ 
rdtscll (tscO) ; 
CycleTot -= tscO; 
CycleCorrect -= tscO; 
while(i){ 

int iNum ; 

BSTEP.PullParticles (ilndex , feiNum, &Sy stemTime ) ; 

rdtscll (tscO) ; 

CycleCorrect += tscO; 

if (SystemTime >= TimeEnd) break; 

CyclePredict -= tscO; 

SSEGRAV_Predict (ptcl , SystemTime/*, nbody*/); 
rdtscll (tscO) ; 
CyclePredict += tscO; 

CycleForce -= tscO; 

CalcAccJerkUsing_iIndex (nbody , ilndex, iNum, ajnew); 
rdtscll (tscO) ; 
CycleForce += tscO; 

CycleCorrect -= tscO; 
f or (k=0 ; k<iNum ; k++) { 

int ii = ilndex [k] ; 

char *addr = (char *) (ptcl + ilndex [k+1] ) ; 
PREFETCHW(addr) ; 
PREFETCHW(addr+64) ; 
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PREFETCH (ajnew+k+1) ; 

HermiteCorrect (ptcl+11 , ajnew+k, SqrtEta) ; 
BSTEP_PushAPart icle 

(ii, ptcl [ii] . timestep , &pt cl [ ii] . t ime st ep ) ; 

} 

if (f modCSystemTime , dEOutTime) == 0.0){ 

f or ( i=0 ; Knbody ; ptcl[i].pot = a j new [i] . pot ; 

Pr intEnegy (ptcl , nbody , InitEnergy , SystemTime); 

} 

NumPtclStep += iNum; 
NumStep++ ; 

} 

rdtscll (tscl ) ; 
CycleTot += tscl; 

WallTime = (double ) CycleTot /MEGAHELTZ / 1 . E6 ; 

printf("WalluClockutime:uy,fuSec\n" , WallTime) ; 

print f ("predict :□'/. fyusec/blockstep ,u'/.fuCycle/loop\n" , 

(double ) CyclePredict / NumStep /MEGAHELTZ , 

( double )CyclePredict/ (NumStep*nbody ) ) ; 
printfC'forceuuiu'/.fuUsec/blockstep ,u'/ofucycle/loop\n" , 

(double) CycleForce/NumStep/MEGAHELTZ , 

(double) CycleForce/ (NumPt clSt ep *nbody ) ) ; 
pr int f ("correct :u'/«fuUsec/blockstep ,u7. fuCycle/loop\n" , 

(double ) CycleCorrect /NumStep/MEGAHELTZ , 

(double )CycleCorrect/ (NumPtclStep) ) ; 
printf ( " y,f upart i cles /blockst ep \n\n " , (double ) NumPt clStep /NumStep ) : 

/* final synchronizing step */ 
SSEGRAV.Predict (ptcl , TimeEnd/*, nbody*/'); 
f or ( i=0 ; i <nbody ; 1++ ) ilndex[l] = i; 

CalcAccJerkUslng_lIndex (nbody , ilndex, nbody, ajnew); 
for(l=0;i<nbody ;!++){ 

ptcl [i]. pot = ajnew[i].pot; 

pt c 1 [ 1] . t ime St ep = TimeEnd - pt c 1 [ i ] . t ime ; 
HermiteCorrect (ptcl+i , ajnew+i, SqrtEta); 

} 

Pr intEnegy (ptcl , nbody, InitEnergy, TimeEnd); 

if(fpout) WriteSnapshot (f pout , nbody, ptcl, TimeEnd); 

SSEGRAV.Finalize () ; 
BSTEP.Finalize () ; 
free (ptcl) ; 
free (ajnew) ; 
free (ilndex) ; 

fclose(fpin) ; 
f c 1 s e ( f pout ) ; 
return 0; 



} 



int main(){ 

GravityDr iver ( " initparm " ) ; 
return 0; 

} 
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